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Abstract 

Feature selection is a pattern recognition approach to choose im- 
portant variables according to some criteria to distinguish or explain 
certain phenomena. There are many genomic and proteomic appli- 
cations which rely on feature selection to answer questions such as: 
selecting signature genes which are informative about some biological 
state, e.g. normal tissues and several types of cancer; or defining a net- 
work of prediction or inference among elements such as genes, proteins, 
external stimuli and other elements of interest. In these applications, 
a recurrent problem is the lack of samples to perform an adequate es- 
timate of the joint probabilities between element states. A myriad of 
feature selection algorithms and criterion functions are proposed, al- 
though it is difficult to point the best solution in general. The intent of 
this work is to provide an open-source multiplataform graphical envi- 
ronment to apply, test and compare many feature selection approaches 
suitable to be used in bioinformatics problems. 



1 Introduction 

The pattern recognition methods allow the classification of objects or pat- 
terns in a number of classes [I]. Specifically in statistical pattern recog- 
nition, given a set Y = {y\,...,y c } of classes and an unknown pattern 
X = {Xi, X2, X n }, a pattern recognition system associates x to a class 
yi based on defined measures in a feature space. In many applications, espe- 
cially in bioinformatics, the feature space dimension tends to be very large, 



making difficult the classification task. In order to overcome this inconve- 
nient situation, the study of dimensionality reduction problem in pattern 
recognition becomes imperative. 

The so called "curse of dimensionality" [2] is a phenomenon in which 
the number of training samples required to a satisfactory classifier perfor- 
mance is given by an exponential function of the feature space. This is the 
main motivation by which performing of dimensionality reduction is impor- 
tant in problems with large number of features and small number of training 
samples. Many bioinformatics applications are perfectly inserted in this con- 
text. Data sets containing mRNA transcription expressions from microarray 
or SAGE, for example, possess thousands of genes (features) and only some 
dozens of samples that may be cell states or types of tissues. If time is a 
factor involved, the samples are called dynamical states, otherwise they are 
called steady states. 

There are basically two dimensionality reduction approaches: feature 
extraction and feature selection [TJ |3j S]. The feature extraction methods 
create new features from transformations or combinations of the original fea- 
ture set. On the other hand, feature selection algorithms just search for the 
optimal feature subset according to some criterion function. The software 
proposed in this paper is initially focused on feature selection methods. 

A feature selection method is composed by two main parts: a search 
algorithm and a criterion function. As far as the search algorithms, there 
are two main categories: the optimal and sub-optimal algorithms. The 
optimal algorithms (including exhaustive and branch-and-bound searches) 
return the best feature subspace, but their computational costs are very high 
to be applied in general. The sub-optimal algorithms do not guarantee that 
the solution is optimal, but some of them present a reasonable cost-benefit 
between computational cost and quality of the solution. Up to now, we have 
implemented in the software the exhaustive search (optimal) , the Sequential 
Forward Selection (SFS - sub-optimal) and the Sequential Forward Floating 
Selection (SFFS - sub-optimal with excellent cost-benefit) [5]. 

There is a large number of criterion functions proposed in the literature. 
The most common functions are based on the classifier error and distances 
between patterns. There are also criterion functions based on information 
theory. They are closely related to the classifier error, but instead of using 
the error, it is based on the conditional entropy of the class probabilities 
distributions given the observed pattern. 

Due to the curse of dimensionality phenomenon, error estimation is a 
crucial issue. We have developed some ways to embed error estimation 
in the criterion functions based on classifier error or conditional entropy. 
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The main idea is based on penalization of non-oberved or rarely observed 
instances. A good advantage in doing this is that the right dimension of the 
feature subset solution is also estimated (the dimension parameter is not 
required). After the feature selection, it is possible to apply classical error 
estimation techniques like resubstitution, leave-one-out, cross validation or 
bootstrap. 

The software is implemented in Java, so it can be executed in many 
operational systems. It is open source and intended to be continuously 
developed in a world-wide collaboration. The software is available at http : 
//dimr eduction. incubadora.fapesp.br/. 

Following this introduction, Section [2] and [3] will describe the feature 
selection algorithms and criterion functions implemented so far. Section [4] 
discusses the implemented software. Section [5] will shows some preliminary 
results obtained on gene regulation networks and classification of breast 
cancer cells. This paper is finalized with some conclusions in Section [6] 

2 Implemented feature selection algorithms 

The first and simpler feature selection algorithm implemented in this work 
is the exhaustive search. This algorithm searches the whole search space, 
and as a result, the selected features are optimal. However in bioinformatics 
context, normally the computational cost makes this approach inadequate. 
Then, it is clear the existence of a trade-off between optimality and compu- 
tational cost. 

An alternative way is to adopt sub-optimal search methods. In this 
work we have implemented two sub-optimal approaches with unique solu- 
tion, which are known as top down and bottom up. In the first one, the 
selection subset starts empty and features are inserted by optimizing a cri- 
terion function until a stop condition is satisfied, which is often based on 
the subset size or a threshold. In the second algorithm, the subset starts 
full and features are removed, trying to optimize the criterion function until 
a stop condition is reached. Methods that implement these approaches are 
known as SFS (Sequential Forward Search) and SBS (Sequential Backward 
Search), respectively. Considering the context of this work, our choice was 
to implement the SFS approach. 

However, these suboptimal search methods present an undesirable draw- 
back known as nesting effect. This effect happens because the discarded 
features in the top-down approach are not inserted anymore, or the inserted 
features in the bottom-up approach are never discarded. 
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In order to circumvent this problem, the Sequential Forward Floating 
Selection (SFFS) [5. was also implemented. The SFFS algorithm tries to 
avoid the nesting effect allowing to insert and exclude features on subset in 
a floating way, i.e. without defining the number of insertions or exclusions. 

The SFFS may be formalized as in [5]. Let Xk = {xi : 1 < i < k, xi 6 X} 
be the subset with k features of the complete set X = {xj:l<i<n} with 
n features available. Let E'(Xk) the criterion function value for the subset 
Xfc. The algorithm initializes with k = 0, therefore the subset Xk is empty. 

First Step (insert): using the SFS method, select the feature xt+i of 
the set X — Xk to form the set Xk+i, such that xt+i be the most relevant 
feature of the subset X^. The new subset is Xk+i = Xk U Xfc+i- 

Second Step (conditional exclusion): Find the least relevant feature in 
the set Xk+i- If Xfc+i is the least relevant feature in the subset Xk+i, then 
k <— k + 1, Xk <— Xk+i and back to the first step. If x r , 1 < r < k is the 
least relevant feature in the subset X^+i, then exclude x r from X^+i to 
form a new subset X k = Xk+i — x r and k <— k — 1. If k = 2, then Xk = X k , 
and return to the first step, else execute the third step. 

Third Step (continuation of conditional exclusion): Find the least rele- 
vant feature x s in the set X' k . If E(K'- k — x s ) < E'(Xk-i), then X k — > X k 
and return to first step. If E(X.' k — x s ) > -E'(Xk-i) then exclude x s from 
X' k to form a new reduced subset X^^ = X' k — x s and k — ► k — 1. If k = 2, 
then X k = X k and return to first step, else repeat the third step. 

The SFFS algorithm starts by setting k = e Xk = 0, and the SFS 
method is used until the subset size k = 2. Then the SBS is performed in 
order to exclude bad features. SFFS proceeds by alternating between SFS 
and SBS until a stop criteria is reached. The best result set for each cardi- 
nality is stored in a list. The best set among them is selected as algorithm 
result, and tie occurs, the set with lower cardinality is selected. 

3 Implemented criterion functions 

We implemented criterion functions based on classifier information (mean 
conditional entropy) and classifier error (Coefficient of Determination [6]), 
introducing some penalization on poorly or non-observed patterns. 

3.1 Mean conditional entropy 

The information theory was originated by Shannon [7] and can be employed 
on feature selection problems [3] . The Shannon's entropy H is a measure of 
randomness of a variable Y given by: 
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H ( Y ) = ~ E P{v)logP{y), (l) 

where P is the probability distribution function. By convention • logO = 0. 

The conditional entropy is a fundamental concept related to the mutual 
information. It is given by the following equation: 

H(Y\X = x) = - P(V\X = x)to</P(y|X = x) (2) 

y&Y 

where X is a feature vector and P(Y\X. = x) is the conditional probability 
of Y given the observation of an instance x £ X. And finally, the mean 
conditional entropy of Y given all the possible instances x G X is given by: 

H(Y\X) = £ P(x)iI(Y|x) (3) 
xex 

Lower values of H yield better feature subspaces (the lower H, the larger 
is the information gained about Y by observing X). 

3.2 Coefficient of Determination 

The Coefficient of Determinstion (CoD) [6], like the conditional entropy, is 
a non-linear criterion useful for feature selection problems [8] . It is given by: 



n n (~\-\ 1 ~ max^gy P(y) - (1 - Exgx -P( x ) max^gy P(j/|x)) 

CoVy(X) = — — 

1 - max yeY P[y) 



(4) 



where 1 — max^gy P{y) is the error of predicting Y in the absence of other ob- 
servations (let us denote it by ey) and 1 — X) x ex rnax^er P(x, y) is the error 
of predicting Y based on the observation of X (let us denote it by ey(X)). 
Larger values of CoD yield better feature subspaces {CoD = means that 
the feature subspace does not improve the priori error and CoD = 1 means 
that the error was fully eliminated). 



3.3 Penalization of non-observed instances 

A way to embed the error estimation caused by using feature vectors with 
large dimensions and insufficient number of samples is to involve non-observed 
instances in the criterion value calculus [9]. A positive probability mass is 
attributed to the non-observed instances and their contribution is the same 
as observing only the Y values with no other observations. 
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In the case of mean conditional entropy, the non-observed instances get 
the entropy equal to H(Y) and, for the CoD, they get the prior error ey 
value. The probability mass for the non-observed instances is parametrized 
by a. This parameter is added to the relative frequency (number of occur- 
rences) of all possible instances. So, the mean conditional entropy with this 
type of penalization becomes: 



H(Y\X) 



1 



aM + s 



A' 



a(M - N)H(Y) + + a)H(Y\X = x, 



(5) 



where M is the number of possible instances of the feature vector X, N is 
the number of observed instances (so, the number of non-observed instances 
is given by M — N), fi is the relative frequence (number of observations) of 
the instance x; and s is the number of samples. 
And CoD becomes: 



\ a{M-N)e Y , 1 spN (fj+a) v p/,.| x . 

CoL>y(X) = (6) 

3.4 Penalization of rarely observed instances 

In this penalization, the non-observed instances are not taken into account. 
This penalization consists in changing the conditional probability distribu- 
tion of the instances that have just a unique observation |lOj . It makes sense 
because if an instance x has only 1 observation, the value of Y is fully de- 
termined {H{Y\K. = x) = and Co-Dy(X) = 1), but the confidence about 
the real distribution of P(Y|X = x) is very low. A parameter (3 gives a 
confidence value that Y = y. The main idea is to distrubute 1 — /3 equally 
over all P(Y ^ y|X = x) and to attribute (5 to P(Y = y|X = x). In Barrera 
et al |10j . the (3 value is j^t where \Y\ is the number of classes (cardinality 
of Y), becoming the uniform distribution (strongest penalization). 

Adapting this penalization to the Equation [3j the mean conditional en- 
tropy becomes: 

H(Y\^) = Ad ^H(F(Y))+ P(x)H(Y\ X ), (7) 

xeX:P(x)>i 

where F(Y) is the probability distribution given by 

m = 



(3, if i = 1 
^f, ifi = 2,3...,c 
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and N in this case is the number of instances x with P(x) > - (more than 
one observation). 

Since £y(x) = 1-/3 when P(Y|x) = \, the CoD with this penalization 
is given by: 



ey - (1 - ^%-ExeX:P(x)>± P ( x ) m «^ey^(y|x)) 
CoL'y(X) = 5 (8) 

3.5 Classifier design and generalization 

After the feature selection using H or CoD, the classifier is designed from 
the table of conditional probabilities where each row is a possible instance 
x G X, each column is a possible class Y = y and each cell of this table 
represents P(Y|X = x). This table is used as a Bayesian classifier where, 
for each given instance, the chosen label Y = y is the one with maximum 
conditional probability for the considered instance. In case of instances that 
have two or more labels of maximum probability (including non-observed 
instances), it is possible to generalize these instances according to some 
criterion. A commonly used criterion is the nearest neighbors with some 
distance metric [I] . We implemented the nearest neighbors using Euclidean 
distance. In this implementation, the nearest neighbors are taken succes- 
sively. The occurrences of each label are summed until only one of such 
labels has the maximum number of occurrences and may be chosen as the 
class to which the considered instance belongs. This featured can be turned 
off. In this case, the label is guessed, i.e., chosen randomly from the labels 
with maximum number of occurrences (including non-observed instances). 



4 Software description 

The software is implemented in Java in order to be executable in differ- 
ent platforms. It is open source and intended to be continuously devel- 
oped in a world-wide collaboration. The software is available at http: 
/ / dimr eduction . incubadora . f apesp . br/. 

There are four main panels: the first panel allows the user to load the 
data set (Figure [l]-a) . The second is optional for the user to define a quan- 
tization degree to the data set. The quantized data may be visualized (Fig- 
ure [l]-b) . It is worth noting that some feature selection criteria like mean 
conditional entropy or CoD require data quantization to discrete values. 
This fact explains the quantization step available in the software. The data 
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(c) Single execution 
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Figure 1: Application panels. 



quantization is based on a common rule, searching for the extreme values 
(positive and negative) and dividing equally the negative and positive space 
considering the number of divisions specified by the quantization degree 
parameter. 

The next step can be the single execution or cross-validation. The first 
one is dedicated to perform single tests (Figure [lj-c) . It is represented by 
a panel where the user is able to enter input parameters such as the fea- 
ture selection algorithm (see Section [2] for the algorithms implemented) and 
the criterion function (see Section [3j for the criteria implemented). Other 
implemented utilities, including the visualization results of the feature se- 
lection, area found in the middle of the panel. There are three forms to 
visualize the results: graphs (Figure [4j, scatterplot (Figure [2]-a) and parallel 
coordinates (Figure |2j-b) . The graphs show the connections among differ- 
ent classes, chosen in feature selection execution, as directed edges between 
selected vertices. The parallel coordinates proposed by [H] allows to visual- 
ize in adjacent axes (selected features) similar patterns of behavior in data, 
visually indicating how separated are the classes, considering the adjacent 
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Figure 2: Examples of scatterplot and parallel coordinates generated by the 
software. 



features. In the software application, the features and it and its order to 
build he parallel coordinates chart are defined by the user. 

The cross-validation panel (Figure [l}d) is very similar to the prior. 
Cross-validation [12] consists in to divide the whole data set in two sub- 
sets: training and test, mutually exclusive, and the user can define the size 
of both sets. The training set is entered as input to the feature selection 
algorithm. The classifier designed from the feature selection and the joint 
probability distributions table labels the test set samples. At the end of 
the cross-validation process, it is plotted a chart with the results of each 
execution, and it is possible to visualize the rate of hits and its variation 
along the executions. 
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Another available option is the generalization of non-observed instances. 
With this option selected, the instances of the selected feature set not present 
in the training samples are generalized by a nearest neighbors method [l a 
with Euclidean distance (see Section 3.5 for more details). This method is 
also applied to take a decision among classes with tied maximum conditional 
probability distributions given a certain instance. 



5 Illustrative Results 

This section presents the results in two main aspects. Initially the software 
was applied as feature selection in a biological classification problem to clas- 
sify breast cancer cells in two possible classes: benign and malignant. The 
biological data used here was obtained from |13j which has 589 instances 
and 32 features. The results shown figure |3j presents very low variations 
and high accurate classification achieving 99.96% of accuracy on average. 
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Figure 3: Cross-validation results using 10 executions, 80% of data as train- 
ing set and 20% as test set. 



The second computational biology problem addressed was gene network 
recovery. In this case we used an artificial gene network generated by the 
approach presented in [T3]. The parameters used were: 10 nodes, binary 
quantization, 20 observations (timestamps), 1 average of edges per vertex 
and Random graphs of Erdos-Renyi as network architecture. In figure [4] it is 
presented the network recovered. This result did not present false negatives 
and just few false positives. 
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6 Conclusion 

The proposed feature selection environment allows data analysis using sev- 
eral algorithms, criterion functions and graphic visualization tools. Since 
it is an open-source and multi-platform software, it is suitable for the user 
that wants to analyze data and draw some conclusions about it, as well as 
for the specialist that has as objective to compare several combinations of 
approaches and parameters for each specific data set or to include more fea- 
tures in the software such as a new algorithm or a new criterion function. 
This system can evolve and include feature extraction methods as well, not 
limited only to feature selection methods. 

The environment can be used in many pattern recognition applications, 
although the main concern is with Bioinformatics tasks, especially those 
involving high-dimensional data (large number of genes, for example) with 
small number of samples. Even users not familiar with programming are 
allowed to manipulate the software in an easy way, just by clicking to select 
file inputs, quantization, algorithms, criterion functions, error estimation 
methods and visualization of the results. The environment is implemented 
as "wizard style", i.e., it has tabs delimiting each procedure. 

This software opens a great space for future works. The next step con- 
sists in the implementation of other classical feature selection algorithms 
(e.g. GSFS and PTA [TJ US]); criterion functions (e.g. based on distances 
between classes [1]), error estimation methods (e.g. Leave-one-out and Boot- 
strap) and then the inclusion of classical methods of feature extraction (e.g. 



11 



pca hei). 
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